Add ScalarValue::RunEndEncoded variant#19895
Conversation
| /// (run-ends field, value field, value) | ||
| RunEndEncoded(FieldRef, FieldRef, Box<ScalarValue>), |
There was a problem hiding this comment.
Mimicking the arrow type where it stores fields:
I tried initially only storing the index DataType and the ScalarValue value, but figured it would be better to try be as accurate as possible 🤔
| | DataType::Decimal32(_, _) | ||
| | DataType::Decimal64(_, _) |
There was a problem hiding this comment.
Little fix since we were missing these
|
|
||
| // Unsupported types for now | ||
| _ => { | ||
| DataType::ListView(_) | DataType::LargeListView(_) => { |
There was a problem hiding this comment.
Just getting rid of the catch-all to be more rigorous
| _ => unreachable!("Invalid dictionary keys type: {}", key_type), | ||
| } | ||
| } | ||
| DataType::RunEndEncoded(run_ends_field, value_field) => { |
There was a problem hiding this comment.
We're building the runarray efficiently here, as unlike dictionary above which would require keeping a hashmap of values to build an efficient dictionary array, run arrays are simpler in that we just need to track when a new run starts.
Most of the verbosity here is related to destructuring input ScalarValues and ensuring we have consistent types from them.
| let run_ends = PrimitiveArray::<R>::from_iter_values(run_ends); | ||
| let values = ScalarValue::iter_to_array(value_scalars)?; | ||
|
|
||
| // Using ArrayDataBuilder so we can maintain the fields |
There was a problem hiding this comment.
I think this is the only way to construct runarrays with fields we want, since try_new creates the fields for us:
| ); | ||
| let err = scalar.eq_array(&run_array, 1).unwrap_err(); | ||
| let expected = "Internal error: could not cast array of type Float32 to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Float64Type>"; | ||
| assert!(err.to_string().starts_with(expected)); |
There was a problem hiding this comment.
Needed to use starts_with since backtrace feature can affect the error message, so direct equality can succeed for cargo test but fail in CI
| return Err(Error::General( | ||
| "Proto serialization error: The RunEndEncoded data type is not yet supported".to_owned() | ||
| )) | ||
| DataType::Decimal32(precision, scale) => { |
There was a problem hiding this comment.
Only change here is for RunEndEncoded; for some reason other formatting changes were applied for the other arms here
|
FYI @brancz |
| } | ||
| (Dictionary(_, _), _) => None, | ||
| (RunEndEncoded(rf1, vf1, v1), RunEndEncoded(rf2, vf2, v2)) => { | ||
| // Don't compare if the run ends fields don't match (it is effectively a different datatype) |
There was a problem hiding this comment.
I'm not sure this is exactly what we want. The run arrays could be logically identical, but their index types might differ. I don't think we'd want the scalar not to equal in that case. I realize that's not what we have for dictionaries either, but is that really the intention of scalars? My understanding has always been that the integer width of codes should be irrelevant from a logical equality perspective.
There was a problem hiding this comment.
I'm not sure if we want that logic as this level; for example if we fix PartialOrd here to compare REE/Dicts based on inner values only, then we'd probably have to do the same for PartialEq right? But then we run into an issue with Hash not being consistent unless we also fix Hash 🤔
I think it might be better to leave these as is, and if we want proper comparison it would make more sense to do at a high level (e.g. via type coercion)
There was a problem hiding this comment.
I do believe that to be the right path ultimately, but I can agree to starting with this and consistently change things as a follow up.
There was a problem hiding this comment.
Let's file a follow on ticket.
There was a problem hiding this comment.
|
Nice! |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#18563 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Support RunEndEncoded scalar values, similar to how we support for Dictionary. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Add new `ScalarValue::RunEndEncoded` enum variant - Fix `ScalarValue::new_default` to support `Decimal32` and `Decimal64` - Support RunEndEncoded type in proto for both `ScalarValue` message and `ArrowType` message ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added tests. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> New variant for `ScalarValue` Protobuf changes to support RunEndEncoded type <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
<!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#18563 <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Support RunEndEncoded scalar values, similar to how we support for Dictionary. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Add new `ScalarValue::RunEndEncoded` enum variant - Fix `ScalarValue::new_default` to support `Decimal32` and `Decimal64` - Support RunEndEncoded type in proto for both `ScalarValue` message and `ArrowType` message <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added tests. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> New variant for `ScalarValue` Protobuf changes to support RunEndEncoded type <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Rationale for this change
Support RunEndEncoded scalar values, similar to how we support for Dictionary.
What changes are included in this PR?
ScalarValue::RunEndEncodedenum variantScalarValue::new_defaultto supportDecimal32andDecimal64ScalarValuemessage andArrowTypemessageAre these changes tested?
Added tests.
Are there any user-facing changes?
New variant for
ScalarValueProtobuf changes to support RunEndEncoded type